MLOps: Deploying Models into Production

Myles Mitchell @ Jumping Rivers

Welcome!

Check in using the QR code! QR code for morning session check-in

Virtual environment

URL

Password: PASSWORD

Screenshot of log in page

Before we start…

Who am I?

  • Background in Astrophysics.

  • Data Scientist @ Jumping Rivers:

    • Python & R support for various clients.

    • Teach courses in Python, R, SQL, Machine Learning.

  • Hobbies include hiking and travelling.

Jumping Rivers

↗ jumpingrivers.com   𝕏 @jumping_uk

  • Machine learning
  • Dashboard development
  • R packages and APIs
  • Data pipelines
  • Code review
     

Introduction to MLOps

Let’s take a step back…

The typical data science workflow:

Typical data science workflow. Starting with data importing and tidying, followed by a cycle of data transformation, data visualisation and modelling which repeats as the model is better understood. The results from this cycle are then communicated.
  • Data is imported and tidied.
  • Cycle of data transformation, data visualisation and modelling.
  • The cycle repeats as we understand the underlying model better.
  • The results are communicated to an external audience.

From Classical Stats to Machine Learning

  • The classical workflow prioritises understanding the system behind the data.
  • By contrast, Machine Learning prioritises prediction.
  • As data grows we reconsider and update our ML models to optimise predictive power.
  • A goal of MLOps is to streamline this cycle.

What is MLOps?

MLOps: Machine Learning Operations

MLOps workflow. Starting with data importing and tidying, followed by modelling and finishing with model versioning, deployment and monitoring. The cycle then repeats as more data is acquired.
  • Framework to continuously build, deploy and maintain ML models.
  • Encapsulates the “full stack” from data acquisition to model deployment.
  • Monitor models in production and detect “model drift”.
  • Versioning of models and data.

MLOps frameworks

  • Amazon SageMaker
  • Microsoft Azure
  • Google Cloud Platform
  • Vetiver by Posit

     

Vetiver

  • Opensource tool maintained by Posit PBC (formerly RStudio).
  • Integrates with popular ML libraries in R and Python.
  • Fluent tooling to version, deploy and monitor a trained model.
  • Supports deploying models to localhost - great way to learn MLOps!

Your first MLOps pipeline

Let’s build an MLOps stack!

  • Data
  • Modelling
  • Deployment
  • Monitoring
  • Repeat

Data best practices

File formats

  • Consider moving from large CSV files to a more efficient format like Parquet.
  • Add a data validation check to prevent unexpected data formats entering the pipeline.

Data best practices

Tidying & cleaning

  • Consider creating an R package to encourage proper documentation, testing and dependency management.
  • Optimise bottlenecks to improve efficiency.
  • Split into training and validation sets.

Data best practices

Versioning

  • Ensuring reproducibility is vital
    • Example: why did your model have poor accuracy for ethic minority background individuals?
  • Include timestamps in your database queries.
  • Ensure your training set can be retrieved in the future.

Data best practices

Take advantage of native MLOps tooling

  • SageMaker:
    • Data wrangler and feature store
    • Canvas for data pre-processing and visualisation

Demo

  • Using {tidyr} and the {palmerpenguins} dataset:

    library("palmerpenguins")
    
    # Remove missing values
    penguins_data = tidyr::drop_na(penguins, flipper_length_mm)
  • Add a step for data splitting.

Task 1: Preparing your data

  • Open task1.txt

  • Give the attendees same task as code above (stored in demo1.txt) but with different dataset.

  • Include a bonus question for R experts (like an extra step for feature selection).

  • Non-R experts should just try running the solution script and ask Myles questions.

  • Not an R user? The solution can be found in task1_solutions.R

  • You have just built a data validation pipeline!

10:00

Modelling best practices

Choosing the right model can be tough!

  • Use cheatsheets like [URL] to identify potential model families.
  • Include cross validation to optimise hyperparameters.
  • Try auto-ML tools like H2O.ai and SageMaker Autopilot.

Modelling best practices

Versioning

  • Store scoring metrics and model parameters from each experiment.
  • Any previously-deployed model should be retrievable, along with the data used to train it.

Demo

  • We’ll consider a basic nearest neighbour model for this workshop.

  • Let’s predict penguin species using island, flipper_length_mm and body_mass_g

library("ggplot2")
ggplot(penguins_data, aes(flipper_length_mm, body_mass_g)) +
  geom_point(aes(colour = species, shape = island)) +
  theme_minimal() +
  xlab("Flipper Length(mm)") +
  ylab("Body Mass(g)") +
  viridis::scale_colour_viridis(discrete = TRUE)

Demo

  • Let’s set up the model recipe in {tidymodels}:
model = recipe(
  species ~ island + flipper_length_mm + body_mass_g, 
  data = penguins_data
) |>
  workflow(nearest_neighbor(mode = "classification")) |> 
  fit(penguins_data)

Demo

Our model object can now be used to predict species:

model_pred = predict(model, penguins_data)
mean(
  model_pred$.pred_class == as.character(
    penguins_data$species
  )
)
  • Our model is 95% accuracy (on the data used to train it…)

Enter Vetiver!

  • Use a Vetiver model object to collate all of the necessary info needed to store, deploy and version our model:

    v_model = vetiver::vetiver_model(model, 
                                   model_name = "k-nn", 
                                   description = "blog-test")
    v_model

Vetiver model

v_model is a list with six elements

  • View the contents:

    names(v_model)
  • View the model description:

    v_model$description

Vetiver model

  • View the metadata:

    v_model$metadata

Model storage

  • Use {pins} to store R or Python objects for reuse later.

  • Store pins using “boards” including Azure, Amazon S3 or even Google drive!

  • Vetiver integrates nicely with Posit Connect:

    vetiver::vetiver_pin_write(board = pins::board_connect(), v_model)

Task 2: Creating a Vetiver model

  • Open task2.txt

  • Run your solution to task 1 to prepare the data

  • Reproduce same modelling code as above but for different model and same dataset as in Task 1

  • Again, provide solution script for non-R people, and bonus question for experts.

  • Your pipeline now includes model training and scoring!
10:00

Deployment best practices

  • Try deploying locally to check that your model API works as expected.

  • Use environment managers like {renv} to store model dependencies.

  • Use containers like Docker to bundle model source code with dependencies.

Deployment using Vetiver

  • We deploy models as APIs which take input data and send back model predictions.

  • We can use a {plumber} API to deploy a {vetiver} model.

Deploying locally

  • {vetiver} and {plumber} support local deployment.

    plumber::pr() |>
      vetiver::vetiver_api(v_model) |>
      plumber::pr_run()
  • Great for beginners to MLOps!

Deploying locally

Check the deployment with:

base_url = "127.0.0.1:7764/"
url = paste0(base_url, "ping")
r = httr::GET(url)
metadata = httr::content(r, as = "text", encoding = "UTF-8")
jsonlite::fromJSON(metadata)

Model predictions

Let’s check that our API works!

  • Endpoints metadata and predict allow programmatic queries:

    url = paste0(base_url, "predict")
    endpoint = vetiver::vetiver_endpoint(url)
    pred_data = penguins_data |>
      dplyr::select(
        "island", "flipper_length_mm", "body_mass_g"
      ) |>
      dplyr::slice_sample(n = 10)
    predict(endpoint, pred_data)
  • Our model gives us predictions!

Task 3: Deploying your model

  • Open task3.txt

  • Repeat local deployment and prediction for model created in Task 2.

  • Provide solution for non-R people.

  • Make sure demo3.R contains code from demo above.

  • Also check that local deployment / prediction works in Posit Workbench.

  • Your model is in production! But we’re not finished yet…
10:00

Deploying to the cloud

  • Vetiver also streamlines deployment to the production environment:

    vetiver::vetiver_prepare_docker(
      pins::board_connect(), 
      "colin/k-nn", 
      docker_args = list(port = 8080)
    )
  • This command:

    • Lists R depedencies with {renv}

    • Stores the {plumber} API code in plumber.R

    • Generates a Dockerfile

Docker files

Our Dockerfile contains a series of commands to:

  • Set the R version and install the system libraries.

  • Install the required R packages.

  • Run the API in the deployed environment.

Running Docker

  • Build a Docker container:

    docker build --tag my-first-model .
  • Inspect your stored Docker images:

    docker image list
  • Run the image:

    docker run --rm --publish 8080:8080 my-first-model
  • These steps can be run in sequence using a Makefile.

Deploying to Connect

  • Vetiver integrates nicely with Posit Connect:

    vetiver::vetiver_deploy_rsconnect(
      board = pins::board_connect(), "colin/k-nn"
    )
  • We can also publish to Amazon SageMaker using vetiver_deploy_sagemaker()

Cost considerations for cloud MLOps

  • AWS offers 2 month free trial for Amazon SageMaker.

  • Azure Machine Learning is offered at no extra charge to existing Azure customers.

  • Costs can rise depending on computational resources consumed.

  • Model building and deployment use different environments.

Monitoring your model

Deployment is just the beginning…

Why should I care?

  • Our model may perform well with current data.

    • But data evolves…
  • As data and user base grows, your model must scale.

Some case studies…

Data evolution

  • The underlying distibutions of data can and will change:

    • The balance of penguin species will change with global warming.
  • The intrinsic relationship between the target variable and predictors can change:

    • Gentoo penguin body mass could change over time due to availability of food.

Model drift

  • As the data changes, our model predictions start to drift.

  • Identifying model drift is vital to any MLOps workflow.

  • Retrain the model with the latest data and redeploy.

Task 4 - Model monitoring

  • How might we identify model drift?

  • Discuss

05:00

Model monitoring

Some best practices:

  • As users query the model API, store the model predictions.

  • A drift in the distribution of the model predictions is a classic sign of drift.

  • As our data grows, run checks of the underlying distributions.

  • When drift is detected, retrain with the latest data and redeploy.

Task 5: Detecting model drift

  • Open task4.txt

  • The lemurs_new.csv contains the latest version of data

  • Run your predicti

  • Your model is in production! But we’re not finished yet…
10:00

Advantages of MLOps

  • Retraining and redeployment can happen at the click of a button.

  • Encourages good practices like model versioning and packaging of source code.

  • Reduces human error.

Thanks for listening!